Techniques for Reliability and Fault Tolerance in OpenMP
نویسندگان
چکیده
Mission critical applications, like those used in flight traffic control systems or in the reactive systems, need to be reliably executed to preserve integrity of their own operating environment and to ensure correct course of operations. The design of these systems must take into account aspects of reliability but, at the same time, cannot neglect the performance aspects. Techniques for adding reliability to computation demand for redundant processors and memory banks. Target architectures of choice are in most cases based on processors with not very high computing power and use special purpose coprocessors to perform heavy load computations, because in most applications the principal task is to control an event flow incoming from the external environment. Whenever the main target is high performance along with reliability, with little reactivity demands, the host architecture for these applications can be a general purpose parallel architecture; most suited is a shared memory multiprocessor architecture. OpenMP [1] has recently emerged as an industry standard interface for high level programming of Shared Memory parallel architectures. Using OpenMP, it is possible to write applications within a shared memory programming model, portable to a wide range of parallel computers. This paper explores the possibility to automatically produce reliable OpenMP code, starting from OpenMP code augmented with suitable directives expressing need for reliability. In this work we propose a technique and a tool for adding to the parallel functionalities offered by OpenMP compilers some new capabilities that allow for automatically obtaining reliable OpenMP codes or reliable code portions within an OpenMP code. We show how such reliable capabilities can be obtained by definition of suitable translation rules; these latter can be implemented within a sourceto-source translator that automatically generates a reliable pure OpenMP compliant code from reliable-annotated OpenMP code.
منابع مشابه
An approach to fault detection and correction in design of systems using of Turbo codes
We present an approach to design of fault tolerant computing systems. In this paper, a technique is employed that enable the combination of several codes, in order to obtain flexibility in the design of error correcting codes. Code combining techniques are very effective, which one of these codes are turbo codes. The Algorithm-based fault tolerance techniques that to detect errors rely on the c...
متن کاملDesign, Testing, and Evaluation Techniques for Software Reliability Engineering
Software reliability is closely influenced by the creation, manifestation and impact of software faults. Consequently, software reliability can be improved by treating software faults properly, using techniques of fault tolerance, fault removal, and fault prediction. Fault tolerance techniques achieve the design for reliability, fault removal techniques achieve the testing for reliability, and ...
متن کاملStability Assessment Metamorphic Approach (SAMA) for Effective Scheduling based on Fault Tolerance in Computational Grid
Grid Computing allows coordinated and controlled resource sharing and problem solving in multi-institutional, dynamic virtual organizations. Moreover, fault tolerance and task scheduling is an important issue for large scale computational grid because of its unreliable nature of grid resources. Commonly exploited techniques to realize fault tolerance is periodic Checkpointing that periodically ...
متن کاملReliability and Performance Evaluation of Fault-aware Routing Methods for Network-on-Chip Architectures (RESEARCH NOTE)
Nowadays, faults and failures are increasing especially in complex systems such as Network-on-Chip (NoC) based Systems-on-a-Chip due to the increasing susceptibility and decreasing feature sizes. On the other hand, fault-tolerant routing algorithms have an evident effect on tolerating permanent faults and improving the reliability of a Network-on-Chip based system. This paper presents reliabili...
متن کاملExtending an Application-Level Checkpointing Tool to Provide Fault Tolerance Support to OpenMP Applications
Despite the increasing popularity of shared-memory systems, there is a lack of tools for providing fault tolerance support to shared-memory applications. CPPC (ComPiler for Portable Checkpointing) is an application-level checkpointing tool focused on the insertion of fault tolerance into long-running MPI applications. This paper presents an extension to CPPC to allow the checkpointing of OpenMP...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2001